It is the prosper loan data set, that contains information about loan listings and related variables including borrower as well as lender information. It contains variables related to Borrower such as credit rating, prosper rating etc. Moreover, the dataset also has lender information.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import os
import glob
%matplotlib inline
#load the data
df = pd.read_csv('prosperLoanData.csv')
#explore the data
pd.set_option('display.max_columns', None)
df.head(10)
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | CO | Other | Self-employed | 2.0 | True | True | NaN | 2007-08-26 18:41:46.780000000 | 640.0 | 659.0 | 2001-10-11 00:00:00 | 5.0 | 4.0 | 12.0 | 1 | 24.0 | 3.0 | 3.0 | 2.0 | 472.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1500.0 | 11.0 | 0.81 | 0.0 | 0.17 | $25,000-49,999 | True | 3083.333333 | E33A3400205839220442E84 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 78 | 19141 | 9425 | 2007-09-12 00:00:00 | Q3 2007 | 1F3E3376408759268057EDA | 330.43 | 11396.14 | 9425.00 | 1971.14 | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | 6.0 | A | 7.0 | 2 | CO | Professional | Employed | 44.0 | False | False | NaN | 2014-02-27 08:28:14 | 680.0 | 699.0 | 1996-03-18 00:00:00 | 14.0 | 14.0 | 29.0 | 13 | 389.0 | 3.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3989.0 | 0.21 | 10266.0 | 29.0 | 1.00 | 2.0 | 0.18 | $50,000-74,999 | True | 6125.000000 | 9E3B37071505919926B1D82 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 134815 | 10000 | 2014-03-03 00:00:00 | Q1 2014 | 1D13370546739025387B2F4 | 318.93 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | GA | Other | Not available | NaN | False | True | 783C3371218786870A73D20 | 2007-01-02 14:09:10.060000000 | 480.0 | 499.0 | 2002-07-27 00:00:00 | NaN | NaN | 3.0 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.06 | Not displayed | True | 2083.333333 | 6954337960046817851BCB2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 86 | 6466 | 3001 | 2007-01-17 00:00:00 | Q1 2007 | 5F7033715035555618FA612 | 123.32 | 4186.63 | 3001.00 | 1185.63 | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | 6.0 | A | 9.0 | 16 | GA | Skilled Labor | Employed | 113.0 | True | False | NaN | 2012-10-22 11:02:32 | 800.0 | 819.0 | 1983-02-28 00:00:00 | 5.0 | 5.0 | 29.0 | 7 | 115.0 | 0.0 | 1.0 | 4.0 | 10056.0 | 14.0 | 0.0 | 0.0 | 1444.0 | 0.04 | 30754.0 | 26.0 | 0.76 | 0.0 | 0.15 | $25,000-49,999 | True | 2875.000000 | A0393664465886295619C51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 16 | 77296 | 10000 | 2012-11-01 00:00:00 | Q4 2012 | 9ADE356069835475068C6D2 | 321.45 | 5143.20 | 4091.09 | 1052.11 | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | 0.18316 | 0.0925 | 0.09066 | 3.0 | D | 4.0 | 2 | MN | Executive | Employed | 44.0 | True | False | NaN | 2013-09-14 18:38:44 | 680.0 | 699.0 | 2004-02-20 00:00:00 | 19.0 | 19.0 | 49.0 | 6 | 220.0 | 1.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6193.0 | 0.81 | 695.0 | 39.0 | 0.95 | 2.0 | 0.26 | $100,000+ | True | 9583.333333 | A180369302188889200689E | 1.0 | 11.0 | 11.0 | 0.0 | 0.0 | 11000.0 | 9947.9 | NaN | 0 | NaN | 6 | 102670 | 15000 | 2013-09-20 00:00:00 | Q3 2013 | 36CE356043264555721F06C | 563.97 | 2819.85 | 1563.22 | 1256.63 | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| 5 | 0F05359734824199381F61D | 1074836 | 2013-12-14 08:26:37.093000000 | NaN | 60 | Current | NaN | 0.15425 | 0.1314 | 0.1214 | 0.11567 | 0.0449 | 0.07077 | 5.0 | B | 10.0 | 1 | NM | Professional | Employed | 82.0 | True | False | NaN | 2013-12-14 08:26:40 | 740.0 | 759.0 | 1973-03-01 00:00:00 | 21.0 | 17.0 | 49.0 | 13 | 1410.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 62999.0 | 0.39 | 86509.0 | 47.0 | 1.00 | 0.0 | 0.36 | $100,000+ | True | 8333.333333 | C3D63702273952547E79520 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 123257 | 15000 | 2013-12-24 00:00:00 | Q4 2013 | 874A3701157341738DE458F | 342.37 | 679.34 | 351.89 | 327.45 | -25.33 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 6 | 0F0A3576754255009D63151 | 750899 | 2013-04-12 09:52:56.147000000 | NaN | 36 | Current | NaN | 0.31032 | 0.2712 | 0.2612 | 0.23820 | 0.1275 | 0.11070 | 2.0 | E | 2.0 | 1 | KS | Sales - Retail | Employed | 172.0 | False | False | NaN | 2013-04-12 09:52:53 | 680.0 | 699.0 | 2000-09-29 00:00:00 | 10.0 | 7.0 | 20.0 | 6 | 214.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5812.0 | 0.72 | 1929.0 | 16.0 | 0.68 | 0.0 | 0.27 | $25,000-49,999 | True | 2083.333333 | CE963680102927767790520 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 11 | 88353 | 3000 | 2013-04-18 00:00:00 | Q2 2013 | AA4535764146102879D5959 | 122.67 | 1226.70 | 604.25 | 622.45 | -22.95 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 7 | 0F1035772717087366F9EA7 | 768193 | 2013-05-05 06:49:27.493000000 | NaN | 36 | Current | NaN | 0.23939 | 0.2019 | 0.1919 | 0.17830 | 0.0799 | 0.09840 | 4.0 | C | 4.0 | 2 | CA | Laborer | Employed | 103.0 | False | False | NaN | 2013-05-05 06:49:25 | 700.0 | 719.0 | 1999-02-25 00:00:00 | 6.0 | 6.0 | 10.0 | 5 | 101.0 | 3.0 | 16.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1260.0 | 0.13 | 2181.0 | 10.0 | 0.80 | 0.0 | 0.24 | $25,000-49,999 | True | 3355.750000 | 0C87368108902149313D53B | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 10 | 90051 | 10000 | 2013-05-13 00:00:00 | Q2 2013 | 737F347089545035681C074 | 372.60 | 3353.40 | 1955.89 | 1397.51 | -69.21 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 8 | 0F043596202561788EA13D5 | 1023355 | 2013-12-02 10:43:39.117000000 | NaN | 36 | Current | NaN | 0.07620 | 0.0629 | 0.0529 | 0.05221 | 0.0099 | 0.04231 | 7.0 | AA | 9.0 | 7 | IL | Food Service | Employed | 269.0 | True | False | NaN | 2013-12-02 10:43:39 | 820.0 | 839.0 | 1993-04-01 00:00:00 | 17.0 | 16.0 | 32.0 | 12 | 219.0 | 1.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9906.0 | 0.11 | 77696.0 | 29.0 | 1.00 | 1.0 | 0.25 | $25,000-49,999 | True | 3333.333333 | 02163700809231365A56A1C | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 121268 | 10000 | 2013-12-12 00:00:00 | Q4 2013 | 49A53699682291323D04D66 | 305.54 | 611.08 | 505.58 | 105.50 | -16.77 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 9 | 0F043596202561788EA13D5 | 1023355 | 2013-12-02 10:43:39.117000000 | NaN | 36 | Current | NaN | 0.07620 | 0.0629 | 0.0529 | 0.05221 | 0.0099 | 0.04231 | 7.0 | AA | 11.0 | 7 | IL | Food Service | Employed | 269.0 | True | False | NaN | 2013-12-02 10:43:39 | 820.0 | 839.0 | 1993-04-01 00:00:00 | 17.0 | 16.0 | 32.0 | 12 | 219.0 | 1.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9906.0 | 0.11 | 77696.0 | 29.0 | 1.00 | 1.0 | 0.25 | $25,000-49,999 | True | 3333.333333 | 02163700809231365A56A1C | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 121268 | 10000 | 2013-12-12 00:00:00 | Q4 2013 | 49A53699682291323D04D66 | 305.54 | 611.08 | 505.58 | 105.50 | -16.77 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
#get more info. about the data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
#get familiar with data
df.describe()
| ListingNumber | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | ListingCategory (numeric) | EmploymentStatusDuration | CreditScoreRangeLower | CreditScoreRangeUpper | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | StatedMonthlyIncome | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.000000 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 113937.000000 | 106312.000000 | 113346.000000 | 113346.000000 | 106333.000000 | 106333.000000 | 113240.000000 | 113937.00000 | 113937.000000 | 113240.000000 | 112778.000000 | 113240.000000 | 106315.000000 | 112947.000000 | 113240.000000 | 106333.000000 | 1.063330e+05 | 106333.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 105383.000000 | 1.139370e+05 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 18928.000000 | 113937.000000 | 16952.000000 | 113937.000000 | 113937.000000 | 113937.00000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 6.278857e+05 | 40.830248 | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | 5.950067 | 2.774209 | 96.071582 | 685.567731 | 704.567731 | 10.317192 | 9.260164 | 26.754539 | 6.96979 | 398.292161 | 1.435085 | 5.584405 | 0.592052 | 984.507059 | 4.154984 | 0.312646 | 0.015094 | 1.759871e+04 | 0.561309 | 11210.225447 | 23.230034 | 0.885897 | 0.802327 | 0.275947 | 5.608026e+03 | 1.421100 | 22.934345 | 22.271949 | 0.613629 | 0.048540 | 8472.311961 | 2930.313906 | -3.223214 | 152.816539 | 16.268464 | 31.896882 | 69444.474271 | 8337.01385 | 272.475783 | 4183.079489 | 3105.536588 | 1077.542901 | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.023460 | 16.550751 | 80.475228 |
| std | 3.280762e+05 | 10.436212 | 0.080364 | 0.074818 | 0.074516 | 0.068467 | 0.046764 | 0.030403 | 1.673227 | 2.376501 | 3.996797 | 94.480605 | 66.458275 | 66.458275 | 5.457866 | 5.022644 | 13.637871 | 4.63097 | 447.159711 | 2.437507 | 6.429946 | 1.978707 | 7158.270157 | 10.160216 | 0.727868 | 0.154092 | 3.293640e+04 | 0.317918 | 19818.361309 | 11.871311 | 0.148179 | 1.097637 | 0.551759 | 7.478497e+03 | 0.764042 | 19.249584 | 18.830425 | 2.446827 | 0.556285 | 7395.507650 | 3806.635075 | 50.063567 | 466.320254 | 9.005898 | 29.974184 | 38930.479610 | 6245.80058 | 192.697812 | 4790.907234 | 4069.527670 | 1183.414168 | 60.675425 | 109.232758 | 2388.513831 | 2357.167068 | 275.657937 | 0.017919 | 0.332353 | 0.232412 | 294.545422 | 103.239020 |
| min | 4.000000e+00 | 12.000000 | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 19.000000 | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -209.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1000.00000 | 0.000000 | -2.349900 | 0.000000 | -2.349900 | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4.009190e+05 | 36.000000 | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | 4.000000 | 1.000000 | 26.000000 | 660.000000 | 679.000000 | 7.000000 | 6.000000 | 17.000000 | 4.00000 | 114.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.121000e+03 | 0.310000 | 880.000000 | 15.000000 | 0.820000 | 0.000000 | 0.140000 | 3.200333e+03 | 1.000000 | 9.000000 | 9.000000 | 0.000000 | 0.000000 | 3500.000000 | 0.000000 | -35.000000 | 0.000000 | 9.000000 | 6.000000 | 37332.000000 | 4000.00000 | 131.620000 | 1005.760000 | 500.890000 | 274.870000 | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 6.005540e+05 | 36.000000 | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | 6.000000 | 1.000000 | 67.000000 | 680.000000 | 699.000000 | 10.000000 | 9.000000 | 25.000000 | 6.00000 | 271.000000 | 1.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.549000e+03 | 0.600000 | 4100.000000 | 22.000000 | 0.940000 | 0.000000 | 0.220000 | 4.666667e+03 | 1.000000 | 16.000000 | 15.000000 | 0.000000 | 0.000000 | 6000.000000 | 1626.550000 | -3.000000 | 0.000000 | 14.000000 | 21.000000 | 68599.000000 | 6500.00000 | 217.740000 | 2583.830000 | 1587.500000 | 700.840100 | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 8.926340e+05 | 36.000000 | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | 8.000000 | 3.000000 | 137.000000 | 720.000000 | 739.000000 | 13.000000 | 12.000000 | 35.000000 | 9.00000 | 525.000000 | 2.000000 | 7.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.952100e+04 | 0.840000 | 13180.000000 | 30.000000 | 1.000000 | 1.000000 | 0.320000 | 6.825000e+03 | 2.000000 | 33.000000 | 32.000000 | 0.000000 | 0.000000 | 11000.000000 | 4126.720000 | 25.000000 | 0.000000 | 22.000000 | 65.000000 | 101901.000000 | 12000.00000 | 371.580000 | 5548.400000 | 4000.000000 | 1458.540000 | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 115.000000 |
| max | 1.255725e+06 | 60.000000 | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | 11.000000 | 20.000000 | 755.000000 | 880.000000 | 899.000000 | 59.000000 | 54.000000 | 136.000000 | 51.00000 | 14985.000000 | 105.000000 | 379.000000 | 83.000000 | 463881.000000 | 99.000000 | 38.000000 | 20.000000 | 1.435667e+06 | 5.950000 | 646285.000000 | 126.000000 | 1.000000 | 20.000000 | 10.010000 | 1.750003e+06 | 8.000000 | 141.000000 | 141.000000 | 42.000000 | 21.000000 | 72499.000000 | 23450.950000 | 286.000000 | 2704.000000 | 44.000000 | 100.000000 | 136486.000000 | 35000.00000 | 2251.510000 | 40702.390000 | 35000.000000 | 15617.030000 | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.000000 | 25000.000000 | 1189.000000 |
# check for null values
df.isnull().sum()
ListingKey 0
ListingNumber 0
ListingCreationDate 0
CreditGrade 84984
Term 0
...
PercentFunded 0
Recommendations 0
InvestmentFromFriendsCount 0
InvestmentFromFriendsAmount 0
Investors 0
Length: 81, dtype: int64
# check for duplicates
df.duplicated().sum()
0
their is no duplicates in the dataframe
# make a copy from the date to clean and work with it
loan = df.copy()
# remove the null values raws in APR
loan = loan[loan['BorrowerAPR'].isnull() == False]
loan.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 113912 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113912 non-null object 1 ListingNumber 113912 non-null int64 2 ListingCreationDate 113912 non-null object 3 CreditGrade 28928 non-null object 4 Term 113912 non-null int64 5 LoanStatus 113912 non-null object 6 ClosedDate 55064 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113912 non-null float64 9 LenderYield 113912 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113912 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113912 non-null bool 22 CurrentlyInGroup 113912 non-null bool 23 GroupKey 13323 non-null object 24 DateCreditPulled 113912 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113912 non-null int64 32 OpenRevolvingMonthlyPayment 113912 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105358 non-null float64 47 IncomeRange 113912 non-null object 48 IncomeVerifiable 113912 non-null bool 49 StatedMonthlyIncome 113912 non-null float64 50 LoanKey 113912 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113912 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113912 non-null int64 62 LoanNumber 113912 non-null int64 63 LoanOriginalAmount 113912 non-null int64 64 LoanOriginationDate 113912 non-null object 65 LoanOriginationQuarter 113912 non-null object 66 MemberKey 113912 non-null object 67 MonthlyLoanPayment 113912 non-null float64 68 LP_CustomerPayments 113912 non-null float64 69 LP_CustomerPrincipalPayments 113912 non-null float64 70 LP_InterestandFees 113912 non-null float64 71 LP_ServiceFees 113912 non-null float64 72 LP_CollectionFees 113912 non-null float64 73 LP_GrossPrincipalLoss 113912 non-null float64 74 LP_NetPrincipalLoss 113912 non-null float64 75 LP_NonPrincipalRecoverypayments 113912 non-null float64 76 PercentFunded 113912 non-null float64 77 Recommendations 113912 non-null int64 78 InvestmentFromFriendsCount 113912 non-null int64 79 InvestmentFromFriendsAmount 113912 non-null float64 80 Investors 113912 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 69.0+ MB
# correct the range of DebtToIncomeRatio
loan.DebtToIncomeRatio = loan.DebtToIncomeRatio.abs()
loan = loan.query('DebtToIncomeRatio <= 3')
# correct the ProsperScore
loan = loan.query('ProsperScore <= 10')
# clean incorrect data in BankcardUtilization
loan = loan.query('BankcardUtilization <= 1')
# drop inefficient coulmns
loan = loan.drop(['ListingKey','ListingNumber','ListingCreationDate','ClosedDate','GroupKey', 'CurrentlyInGroup', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines','InquiriesLast6Months', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'LoanKey', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginationDate', 'MemberKey', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount' ],axis=1)
loan.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 75446 entries, 1 to 113936 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditGrade 0 non-null object 1 Term 75446 non-null int64 2 LoanStatus 75446 non-null object 3 BorrowerAPR 75446 non-null float64 4 BorrowerRate 75446 non-null float64 5 LenderYield 75446 non-null float64 6 EstimatedEffectiveYield 75446 non-null float64 7 EstimatedLoss 75446 non-null float64 8 EstimatedReturn 75446 non-null float64 9 ProsperRating (numeric) 75446 non-null float64 10 ProsperRating (Alpha) 75446 non-null object 11 ProsperScore 75446 non-null float64 12 ListingCategory (numeric) 75446 non-null int64 13 BorrowerState 75446 non-null object 14 Occupation 74139 non-null object 15 EmploymentStatus 75446 non-null object 16 EmploymentStatusDuration 75432 non-null float64 17 IsBorrowerHomeowner 75446 non-null bool 18 TotalCreditLinespast7years 75446 non-null float64 19 OpenRevolvingAccounts 75446 non-null int64 20 OpenRevolvingMonthlyPayment 75446 non-null float64 21 TotalInquiries 75446 non-null float64 22 CurrentDelinquencies 75446 non-null float64 23 AmountDelinquent 75446 non-null float64 24 DelinquenciesLast7Years 75446 non-null float64 25 PublicRecordsLast10Years 75446 non-null float64 26 BankcardUtilization 75446 non-null float64 27 AvailableBankcardCredit 75446 non-null float64 28 TotalTrades 75446 non-null float64 29 TradesNeverDelinquent (percentage) 75446 non-null float64 30 TradesOpenedLast6Months 75446 non-null float64 31 DebtToIncomeRatio 75446 non-null float64 32 IncomeRange 75446 non-null object 33 IncomeVerifiable 75446 non-null bool 34 StatedMonthlyIncome 75446 non-null float64 35 TotalProsperLoans 16639 non-null float64 36 TotalProsperPaymentsBilled 16639 non-null float64 37 OnTimeProsperPayments 16639 non-null float64 38 ProsperPaymentsLessThanOneMonthLate 16639 non-null float64 39 ProsperPaymentsOneMonthPlusLate 16639 non-null float64 40 LoanOriginalAmount 75446 non-null int64 41 LoanOriginationQuarter 75446 non-null object 42 MonthlyLoanPayment 75446 non-null float64 43 LP_CustomerPayments 75446 non-null float64 44 LP_CustomerPrincipalPayments 75446 non-null float64 45 LP_InterestandFees 75446 non-null float64 46 LP_ServiceFees 75446 non-null float64 47 LP_NonPrincipalRecoverypayments 75446 non-null float64 48 PercentFunded 75446 non-null float64 49 Recommendations 75446 non-null int64 50 Investors 75446 non-null int64 dtypes: bool(2), float64(35), int64(6), object(8) memory usage: 28.9+ MB
The Prosper loan dataset contains 113,937 observations of 81 variables. The observations refer to loan listings on Prosper.com from late 2005 until 2014, and various characteristics of those loans. The data seems “tidy,” according to Hadley Wickham’s definition: the variable names are not variables themselves, so there is not much work required in the way of “tidying” the data.
The histogram of Loan Origination Quarter shows a big dip in listings from Q4 2008 into 2009-10. This time period coincides with the (A) collapse of Lehman Brothers and the ensuing fallout in the global financial system, and (B) Prosper’s decision to register with the SEC. It appears that some combination of A and B caused Prosper to change how it does business. It will be interesting to take a look at how Prosper’s business changed over time.
I'm most interested in figureing out what features are best for predicting the borrower's Annual Percentage Rate (APR) for the loan.
My main investigation will be into how Prosper’s credit policies and borrower characteristics changed over time as its business model and the larger lending environment evolved. So having the Loan Origination Quarter is critical. And being able to see how much Prosper’s lenders earned or lost on each loan with the Payments, Fees, and Loss fields could prove helpful in illuminating what is going on with these loans.
I expect that the total loan amount will have a negative effect on the APR of the loan: the larger the total loan amount, the lower the APR. I also think that the borrowers stated monthly income, loan term, Prosper rating, employment status will also have effects on the APR.
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
loan.head()
| CreditGrade | Term | LoanStatus | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | LoanOriginalAmount | LoanOriginationQuarter | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NaN | 36 | Current | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | 6.0 | A | 7.0 | 2 | CO | Professional | Employed | 44.0 | False | 29.0 | 13 | 389.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.21 | 10266.0 | 29.0 | 1.00 | 2.0 | 0.18 | $50,000-74,999 | True | 6125.000000 | NaN | NaN | NaN | NaN | NaN | 10000 | Q1 2014 | 318.93 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 1 |
| 3 | NaN | 36 | Current | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | 6.0 | A | 9.0 | 16 | GA | Skilled Labor | Employed | 113.0 | True | 29.0 | 7 | 115.0 | 1.0 | 4.0 | 10056.0 | 14.0 | 0.0 | 0.04 | 30754.0 | 26.0 | 0.76 | 0.0 | 0.15 | $25,000-49,999 | True | 2875.000000 | NaN | NaN | NaN | NaN | NaN | 10000 | Q4 2012 | 321.45 | 5143.20 | 4091.09 | 1052.11 | -108.01 | 0.0 | 1.0 | 0 | 158 |
| 4 | NaN | 36 | Current | 0.24614 | 0.2085 | 0.1985 | 0.18316 | 0.0925 | 0.09066 | 3.0 | D | 4.0 | 2 | MN | Executive | Employed | 44.0 | True | 49.0 | 6 | 220.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.81 | 695.0 | 39.0 | 0.95 | 2.0 | 0.26 | $100,000+ | True | 9583.333333 | 1.0 | 11.0 | 11.0 | 0.0 | 0.0 | 15000 | Q3 2013 | 563.97 | 2819.85 | 1563.22 | 1256.63 | -60.27 | 0.0 | 1.0 | 0 | 20 |
| 5 | NaN | 60 | Current | 0.15425 | 0.1314 | 0.1214 | 0.11567 | 0.0449 | 0.07077 | 5.0 | B | 10.0 | 1 | NM | Professional | Employed | 82.0 | True | 49.0 | 13 | 1410.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.39 | 86509.0 | 47.0 | 1.00 | 0.0 | 0.36 | $100,000+ | True | 8333.333333 | NaN | NaN | NaN | NaN | NaN | 15000 | Q4 2013 | 342.37 | 679.34 | 351.89 | 327.45 | -25.33 | 0.0 | 1.0 | 0 | 1 |
| 6 | NaN | 36 | Current | 0.31032 | 0.2712 | 0.2612 | 0.23820 | 0.1275 | 0.11070 | 2.0 | E | 2.0 | 1 | KS | Sales - Retail | Employed | 172.0 | False | 20.0 | 6 | 214.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.72 | 1929.0 | 16.0 | 0.68 | 0.0 | 0.27 | $25,000-49,999 | True | 2083.333333 | NaN | NaN | NaN | NaN | NaN | 3000 | Q2 2013 | 122.67 | 1226.70 | 604.25 | 622.45 | -22.95 | 0.0 | 1.0 | 0 | 1 |
# plotting the BorrowerAPR
plt.figure(figsize=[8, 8])
plt.suptitle('The BorrowerAPR ditribution');
bins = np.arange(0, loan.BorrowerAPR.max()+0.02,0.02)
plt.hist(data=loan, x='BorrowerAPR', bins=bins);
plt.xlim(0,0.45);
interest rates are pretty high, averaging 19.28%. There are also material spikes at higher interest rates over 30%. Since most of these loans are for debt consolidation, these high rates must still be better than the interest rates the borrowers would have to pay to credit card companies. Or, perhaps the borrowers see value in freeing up the borrowing limit on their credit cards, even if it means paying high rates to Prosper lenders.
# plotting the LoanOriginalAmount
plt.figure(figsize=[8, 8])
plt.suptitle('The LoanOriginalAmount ditribution');
bins = np.arange(0, loan.LoanOriginalAmount.max()+1000,1000)
plt.hist(data=loan, x='LoanOriginalAmount', bins=bins);
The loan amounts skew relatively small, with the median of 6,500 less than the mean of 8,337. Based on the past two histograms, a typical loan is for consolidating less than 10,000 of debt (probably credit card debt).
# ploting OnTimeProsperPayments distribution
plt.figure(figsize=[8, 8])
plt.suptitle('The OnTimeProsperPayments ditribution');
bins = np.arange(0, loan.OnTimeProsperPayments.max()+5,5)
plt.hist(data=loan, x='OnTimeProsperPayments', bins=bins);
plt.xlim(0,150);
OnTimeProsperPayments ditribution between 0 to 70.
# ploting MonthlyLoanPayment distribution
plt.figure(figsize=[8, 8])
plt.suptitle('The MonthlyLoanPayment ditribution');
bins = np.arange(0, loan.MonthlyLoanPayment.max()+50,50)
plt.hist(data=loan, x='MonthlyLoanPayment', bins=bins);
MonthlyLoanPayment distribution is between 100 to 700, the highest distribution in 200.
# ploting DebtToIncomeRatio distribution
plt.figure(figsize=[8, 8])
plt.suptitle('The DebtToIncomeRatio ditribution');
bins = np.arange(0, loan.DebtToIncomeRatio.max()+0.2,0.2)
plt.hist(data=loan, x='DebtToIncomeRatio', bins=bins);
The vast majority of Debt-to-Income Ratios are less than 0.5. Excluding the outliers on the high side on the right, the data is close to having the shape of a normal distribution, though it is skewed slightly left.
# ploting LoanOriginationQuarter distribution
plt.figure(figsize=[11, 8])
quarter = ['Q1 2007','Q2 2007','Q3 2007','Q4 2007','Q1 2008','Q2 2008','Q3 2008','Q4 2008','Q1 2009','Q2 2009','Q3 2009','Q4 2009','Q1 2010','Q2 2010','Q3 2010','Q4 2010','Q1 2011','Q2 2011','Q3 2011','Q4 2011','Q1 2012','Q2 2012','Q3 2012','Q4 2012','Q1 2013','Q2 2013','Q3 2013','Q4 2013','Q1 2014']
Qcat = pd.api.types.CategoricalDtype(ordered=True, categories=quarter)
df['LoanOriginationQuarter'] = df['LoanOriginationQuarter'].astype(Qcat)
default_color = sb.color_palette()[0]
sb.countplot(data = df, x = 'LoanOriginationQuarter', color = default_color)
plt.suptitle('The LoanOriginationQuarter ditribution');
plt.xticks(rotation=90);
This time frame coincides with the collapse of Lehman Brothers and the subsequent fallout in the global financial system from Q4 2008 to 2009. It took almost four years before the listing rate hit the levels of Q2 2008. Although Prosper is an alternative to conventional loan models, its company appears not to have been resistant to the global economic crisis. I'm wondering now if this is an economic crisis? induced Thrive to alter the way it does business. Perhaps, before the financial crisis, Prosper's credit policies were much looser? I'm going to set the issue away for later. It appears, after all, that only in July 2009 did Prosper create its Prosper Rating and Prosper Ranking.
# plotting the Term
plt.figure(figsize=[8, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = loan, x = 'Term', color = default_color)
plt.suptitle('The Term ditribution');
Loan terms are either 12, 36, or 60 months, with the vast majority being 36-months.
df.CreditGrade.value_counts()
C 5649 D 5153 B 4389 AA 3509 HR 3508 A 3315 E 3289 NC 141 Name: CreditGrade, dtype: int64
# plotting the CreditGrade
plt.figure(figsize=[8, 8])
default_color = sb.color_palette()[0]
rquarter = ['AA','A','B','C','D','E','HR','NC']
bcat = pd.api.types.CategoricalDtype(ordered=True, categories=rquarter)
df['CreditGrade'] = df['CreditGrade'].astype(bcat)
sb.countplot(data = df, x = 'CreditGrade', color = default_color)
plt.suptitle('The CreditGrade ditribution');
The most redundent credite grade in the data is C and D and the other nearly same distribution and least category is NC
# plotting the LoanStatus
plt.figure(figsize=[8, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = loan, x = 'LoanStatus', color = default_color)
plt.suptitle('The LoanStatus ditribution');
plt.xticks(rotation=90);
Overall, it appears that a large majority of loans are either Completed or Current, though there are also a large number of Charged-off and Defaulted (non-performing). A little later I will want to look at loan performance based on origination vintage.
# plotting the ProsperScore
plt.figure(figsize=[8, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = loan, x = 'ProsperScore', color = default_color)
plt.suptitle('The ProsperScore ditribution');
This field is more or less normally distributed. We can conclude that, overall, risk ratings are relatively normal across the sample.
# plotting the BorrowerState
plt.figure(figsize=[11, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = loan, x = 'BorrowerState', color = default_color)
plt.suptitle('The BorrowerState ditribution');
plt.xticks(rotation=90);
The most Borrower states is CA.
# plotting the IncomeVerifiable
plt.figure(figsize=[8, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = loan, x = 'IncomeVerifiable', color = default_color)
plt.suptitle('The IncomeVerifiable ditribution');
Most of IncomeVerifiable is True
# plotting the IsBorrowerHomeowner
plt.figure(figsize=[8, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = loan, x = 'IsBorrowerHomeowner', color = default_color)
plt.suptitle('The IsBorrowerHomeowner ditribution');
The data is normaly distributed in IsBorrowerHomeowner
# plotting the Occupation
plt.figure(figsize=[15, 8])
default_color = sb.color_palette()[0]
freq = loan.Occupation.value_counts().index
sb.countplot(data = loan, x = 'Occupation', color = default_color, order=freq)
plt.suptitle('The Occupation ditribution');
plt.xticks(rotation=90);
he majority put “Other” for their occupation. That is not too helpful. The largest non-Other category is “Professional,” which is also another unhelpful, generic catch-all. But perhaps we can compare the loan performance of some of the other borrower professions down the road.
From the brevious analysis I found that:
The distributions of stated monthly income is highly right screwed. Most stated monthly incomes are less than 30k, but some of them are incredibly high, like greater than 100k. Surprisingly, most of borrowers with greater than 100k monthly income only loan less than 5k dollars. So, the very large stated monthly income may be made up. Overall, Less than 0.3 percent borrowers have stated monthly income greater than 30k, these can be seemed as outlier for the following analysis, so it is better to remove borrower records with income greater than 30k.
The majority of loans are actually current loans. Since our main goal is to define driving factors of outcome of loan we are not interested in any current loans (and loans with specified past due period) also chargedoff loans can be considered as defaulted.
# save the clean data to a .csv file
loan.to_csv('loan_clean.csv', index=False)
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
plt.figure(figsize=[11, 11]);
plt.suptitle('BorrowerARP with Borrower credit information');
# BorrowerAPR vs CreditGrade
plt.subplot(2, 2, 1);
base_color = sb.color_palette()[0]
sb.boxplot(data=df, x='CreditGrade', y='BorrowerAPR', color=base_color);
plt.xlabel('CreditGrade');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs ProsperScore
plt.subplot(2, 2, 2);
sb.boxplot(data=loan, x='ProsperScore', y='BorrowerAPR', color=base_color);
plt.xlabel('ProsperScore');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs TotalCreditLinespast7years
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='TotalCreditLinespast7years', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('TotalCreditLinespast7years');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs BankcardUtilization
plt.subplot(2, 2, 4);
sb.scatterplot(data=loan, x='BankcardUtilization', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('BankcardUtilization');
plt.ylabel('BorrowerAPR');
As the credit grade decreases the Borrower APR increases, and the narrower in distribution decreases. In the prosper score as it increases the Borrower APR decreases.
plt.figure(figsize=[20, 25]);
plt.suptitle('BorrowerARP with Borrower personal information');
# BorrowerAPR vs BorrowerState
plt.subplot(4, 2, 1);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='BorrowerState', y='BorrowerAPR', color=base_color);
plt.xlabel('BorrowerState');
plt.ylabel('BorrowerAPR');
plt.xticks(rotation=90);
# BorrowerAPR vs EmploymentStatus
plt.subplot(4, 2, 2);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='EmploymentStatus', y='BorrowerAPR', color=base_color);
plt.xlabel('EmploymentStatus');
plt.ylabel('BorrowerAPR');
plt.xticks(rotation=10);
# BorrowerAPR vs IsBorrowerHomeowner
plt.subplot(4, 2, 3);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='IsBorrowerHomeowner', y='BorrowerAPR', color=base_color);
plt.xlabel('IsBorrowerHomeowner');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs OpenRevolvingMonthlyPayment
plt.subplot(4, 2, 4);
sb.scatterplot(data=loan, x='OpenRevolvingMonthlyPayment', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('OpenRevolvingMonthlyPayment');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs TradesOpenedLast6Months
plt.subplot(4, 2, 5);
sb.scatterplot(data=loan, x='TradesOpenedLast6Months', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('TradesOpenedLast6Months');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs MonthlyLoanPayment
plt.subplot(4, 2, 6);
sb.scatterplot(data=loan, x='MonthlyLoanPayment', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('MonthlyLoanPayment');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs DebtToIncomeRatio
plt.subplot(4, 2, 7);
sb.scatterplot(data=loan, x='DebtToIncomeRatio', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('DebtToIncomeRatio');
plt.ylabel('BorrowerAPR');
plt.xlim(0, 1);
Not quite information in this analysis except the borrower who hasn't have home slitly have a high borrower APR
plt.figure(figsize=[11, 11]);
plt.suptitle('BorrowerARP with Borrower credit information');
# BorrowerAPR vs TotalInquiries
plt.subplot(2, 2, 1);
sb.scatterplot(data=loan, x='TotalInquiries', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('TotalInquiries');
plt.ylabel('BorrowerAPR');
plt.xlim(0, 75);
# BorrowerAPR vs DelinquenciesLast7Years
plt.subplot(2, 2, 2);
sb.scatterplot(data=loan, x='DelinquenciesLast7Years', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('DelinquenciesLast7Years');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs PublicRecordsLast10Years
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='PublicRecordsLast10Years', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('PublicRecordsLast10Years');
plt.ylabel('BorrowerAPR');
plt.xlim(0, 30);
As the public records and Delinquencies increases the Borrower APR slitly decreses, but still not an effecient information.
plt.figure(figsize=[11, 11]);
plt.suptitle('BorrowerARP with lender information');
# BorrowerAPR vs LP_ServiceFees
plt.subplot(2, 2, 1);
sb.scatterplot(data=loan, x='LP_ServiceFees', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('LP_ServiceFees');
plt.ylabel('BorrowerAPR');
plt.xlim(-500,0);
# BorrowerAPR vs Recommendations
plt.subplot(2, 2, 2);
sb.scatterplot(data=loan, x='Recommendations', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('Recommendations');
plt.ylabel('BorrowerAPR');
plt.xlim(-1,10)
# BorrowerAPR vs Investors
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='Investors', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('Investors');
plt.ylabel('BorrowerAPR');
As the number of investors increases the Borrower APR decreases.
plt.figure(figsize=[11, 11]);
plt.suptitle('BorrowerARP with payment information');
# BorrowerAPR vs OnTimeProsperPayments
plt.subplot(2, 2, 1);
sb.scatterplot(data=loan, x='OnTimeProsperPayments', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('OnTimeProsperPayments');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs ProsperPaymentsLessThanOneMonthLate
plt.subplot(2, 2, 2);
sb.scatterplot(data=loan, x='ProsperPaymentsLessThanOneMonthLate', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('ProsperPaymentsLessThanOneMonthLate');
plt.ylabel('BorrowerAPR');
# BorrowerAPR vs ProsperPaymentsOneMonthPlusLate
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='ProsperPaymentsOneMonthPlusLate', y='BorrowerAPR', alpha= 0.2);
plt.xlabel('ProsperPaymentsOneMonthPlusLate');
plt.ylabel('BorrowerAPR');
There is no additional information in this analysis
plt.figure(figsize=[11, 6]);
plt.suptitle('ProsperScore with Borrower credit information');
# ProsperScore vs TotalCreditLinespast7years
plt.subplot(1, 2, 1);
sb.scatterplot(data=loan, x='TotalCreditLinespast7years', y='ProsperScore', alpha= 0.2);
plt.xlabel('TotalCreditLinespast7years');
plt.ylabel('ProsperScore');
# ProsperScore vs BankcardUtilization
plt.subplot(1, 2, 2);
sb.scatterplot(data=loan, x='BankcardUtilization', y='ProsperScore', alpha= 0.2);
plt.xlabel('BankcardUtilization');
plt.ylabel('ProsperScore');
plt.xlim(0, 1);
plt.figure(figsize=[20, 25]);
plt.suptitle('ProsperScore with Borrower personal information');
# ProsperScore vs BorrowerState
plt.subplot(4, 2, 1);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='BorrowerState', y='ProsperScore', color=base_color);
plt.xlabel('BorrowerState');
plt.ylabel('ProsperScore');
plt.xticks(rotation=90);
# ProsperScore vs EmploymentStatus
plt.subplot(4, 2, 2);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='EmploymentStatus', y='ProsperScore', color=base_color);
plt.xlabel('EmploymentStatus');
plt.ylabel('ProsperScore');
plt.xticks(rotation=10);
# ProsperScore vs IsBorrowerHomeowner
plt.subplot(4, 2, 3);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='IsBorrowerHomeowner', y='ProsperScore', color=base_color);
plt.xlabel('IsBorrowerHomeowner');
plt.ylabel('ProsperScore')
# ProsperScore vs OpenRevolvingMonthlyPayment
plt.subplot(4, 2, 4);
sb.scatterplot(data=loan, x='OpenRevolvingMonthlyPayment', y='ProsperScore', alpha= 0.2);
plt.xlabel('OpenRevolvingMonthlyPayment');
plt.ylabel('ProsperScore');
# ProsperScore vs TradesOpenedLast6Months
plt.subplot(4, 2, 5);
sb.scatterplot(data=loan, x='TradesOpenedLast6Months', y='ProsperScore', alpha= 0.2);
plt.xlabel('TradesOpenedLast6Months');
plt.ylabel('ProsperScore');
# ProsperScore vs MonthlyLoanPayment
plt.subplot(4, 2, 6);
sb.scatterplot(data=loan, x='MonthlyLoanPayment', y='ProsperScore', alpha= 0.2);
plt.xlabel('MonthlyLoanPayment');
plt.ylabel('ProsperScore');
# ProsperScore vs DebtToIncomeRatio
plt.subplot(4, 2, 7);
sb.scatterplot(data=loan, x='DebtToIncomeRatio', y='ProsperScore', alpha= 0.2);
plt.xlabel('DebtToIncomeRatio');
plt.ylabel('ProsperScore');
plt.xlim(0, 1);
plt.figure(figsize=[11, 11]);
plt.suptitle('ProsperScore with payment information');
# ProsperScore vs OnTimeProsperPayments
plt.subplot(2, 2, 1);
sb.scatterplot(data=loan, x='OnTimeProsperPayments', y='ProsperScore', alpha= 0.2);
plt.xlabel('OnTimeProsperPayments');
plt.ylabel('ProsperScore');
# ProsperScore vs ProsperPaymentsLessThanOneMonthLate
plt.subplot(2, 2, 2);
sb.scatterplot(data=loan, x='ProsperPaymentsLessThanOneMonthLate', y='ProsperScore', alpha= 0.2);
plt.xlabel('ProsperPaymentsLessThanOneMonthLate');
plt.ylabel('ProsperScore');
# ProsperScore vs ProsperPaymentsOneMonthPlusLate
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='ProsperPaymentsOneMonthPlusLate', y='ProsperScore', alpha= 0.2);
plt.xlabel('ProsperPaymentsOneMonthPlusLate');
plt.ylabel('ProsperScore');
plt.figure(figsize=[11, 11]);
plt.suptitle('Investors with Borrower credit information');
# Investors vs CreditGrade
plt.subplot(2, 2, 1);
base_color = sb.color_palette()[0]
sb.boxplot(data=df, x='CreditGrade', y='Investors', color=base_color);
plt.xlabel('CreditGrade');
plt.ylabel('Investors');
# Investors vs ProsperScore
plt.subplot(2, 2, 2);
sb.boxplot(data=loan, x='ProsperScore', y='Investors', color=base_color);
plt.xlabel('ProsperScore');
plt.ylabel('Investors');
# Investors vs TotalCreditLinespast7years
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='TotalCreditLinespast7years', y='Investors', alpha= 0.2);
plt.xlabel('TotalCreditLinespast7years');
plt.ylabel('Investors');
# Investors vs BankcardUtilization
plt.subplot(2, 2, 4);
sb.scatterplot(data=loan, x='BankcardUtilization', y='Investors', alpha= 0.2);
plt.xlabel('BankcardUtilization');
plt.ylabel('Investors');
plt.xlim(0, 1);
As decreasing the credite grade the number of investors decreases and increasing the prosper the number of investor increases.
plt.figure(figsize=[20, 25]);
plt.suptitle('Investors with Borrower personal information');
# Investors vs BorrowerState
plt.subplot(4, 2, 1);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='BorrowerState', y='Investors', color=base_color);
plt.xlabel('BorrowerState');
plt.ylabel('Investors');
plt.xticks(rotation=90);
# Investors vs EmploymentStatus
plt.subplot(4, 2, 2);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='EmploymentStatus', y='Investors', color=base_color);
plt.xlabel('EmploymentStatus');
plt.ylabel('Investors');
plt.xticks(rotation=10);
# Investors vs IsBorrowerHomeowner
plt.subplot(4, 2, 3);
base_color = sb.color_palette()[0]
sb.boxplot(data=loan, x='IsBorrowerHomeowner', y='Investors', color=base_color);
plt.xlabel('IsBorrowerHomeowner');
plt.ylabel('Investors')
# Investors vs OpenRevolvingMonthlyPayment
plt.subplot(4, 2, 4);
sb.scatterplot(data=loan, x='OpenRevolvingMonthlyPayment', y='Investors', alpha= 0.2);
plt.xlabel('OpenRevolvingMonthlyPayment');
plt.ylabel('Investors');
# Investors vs TradesOpenedLast6Months
plt.subplot(4, 2, 5);
sb.scatterplot(data=loan, x='TradesOpenedLast6Months', y='Investors', alpha= 0.2);
plt.xlabel('TradesOpenedLast6Months');
plt.ylabel('Investors');
# Investors vs MonthlyLoanPayment
plt.subplot(4, 2, 6);
sb.scatterplot(data=loan, x='MonthlyLoanPayment', y='Investors', alpha= 0.2);
plt.xlabel('MonthlyLoanPayment');
plt.ylabel('Investors');
# Investors vs DebtToIncomeRatio
plt.subplot(4, 2, 7);
sb.scatterplot(data=loan, x='DebtToIncomeRatio', y='Investors', alpha= 0.2);
plt.xlabel('DebtToIncomeRatio');
plt.ylabel('Investors');
plt.xlim(0, 1);
the highest investor number in the employment stast full-time and the highest number of investors in the lowest range in dent to income ratio.
plt.figure(figsize=[11, 11]);
plt.suptitle('Investors with payment information');
# Investors vs OnTimeProsperPayments
plt.subplot(2, 2, 1);
sb.scatterplot(data=loan, x='OnTimeProsperPayments', y='Investors', alpha= 0.2);
plt.xlabel('OnTimeProsperPayments');
plt.ylabel('Investors');
# Investors vs ProsperPaymentsLessThanOneMonthLate
plt.subplot(2, 2, 2);
sb.scatterplot(data=loan, x='ProsperPaymentsLessThanOneMonthLate', y='Investors', alpha= 0.2);
plt.xlabel('ProsperPaymentsLessThanOneMonthLate');
plt.ylabel('Investors');
# Investors vs ProsperPaymentsOneMonthPlusLate
plt.subplot(2, 2, 3);
sb.scatterplot(data=loan, x='ProsperPaymentsOneMonthPlusLate', y='Investors', alpha= 0.2);
plt.xlabel('ProsperPaymentsOneMonthPlusLate');
plt.ylabel('Investors');
the lowest the time in payment the higher the investor numbers.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
#payment correlation
plt.figure(figsize = [11, 8])
num_vars = ['LoanOriginalAmount', 'BorrowerAPR', 'MonthlyLoanPayment', 'DebtToIncomeRatio','OpenRevolvingMonthlyPayment']
sb.heatmap(loan[num_vars].corr(), annot = True, fmt = '.3f', cmap = 'Blues', center = 0);
Interestingly there is no strong correlation between variables in this data set. There is some moderate positive and moderate negative correlation.
# BorrowerAPR in terms of loan amount and duration
plt.figure(figsize=[11, 6]);
duration=sb.FacetGrid(data=loan, col='Term', height=5, col_wrap=3)
duration.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
duration.add_legend();
<Figure size 792x432 with 0 Axes>
Term doesn't seem to have effect on relationship of APR and loan amount
# BorrowerAPR in terms of loan amount and duration
plt.figure(figsize=[11, 11]);
duration=sb.FacetGrid(data=df, col='CreditGrade', height=5, col_wrap=4)
duration.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
duration.add_legend();
<Figure size 792x792 with 0 Axes>
The loan amount increases with better rating. The borrower APR decreases with better rating. Interestingly, the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings are increased from HR to A or better. This may because people with A or AA ratings tend to borrow more money, increasting APR could prevent them borrow even more and maximize the profit. But people with lower ratings tend to borrow less money, decreasing APR could encourage them to borrow more.
plt.figure(figsize=[15, 11]);
plt.suptitle('ProsperScore by loan duration');
plt.subplot(2, 2, 1);
sb.pointplot(data = loan, x = 'ProsperScore', y = 'BorrowerAPR', hue = 'Term', palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.xlabel('ProsperScore');
plt.ylabel('BorrowerAPR');
plt.subplot(2, 2, 2);
sb.pointplot(data = loan, x = 'ProsperScore', y = 'DebtToIncomeRatio', hue = 'Term', palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.xlabel('ProsperScore');
plt.ylabel('DebtToIncomeRatio');
plt.subplot(2, 2, 3);
sb.pointplot(data = loan, x = 'ProsperScore', y = 'LoanOriginalAmount', hue = 'Term', palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.xlabel('ProsperScore');
plt.ylabel('LoanOriginalAmount');
plt.subplot(2, 2, 4);
sb.pointplot(data = loan, x = 'ProsperScore', y = 'Investors', hue = 'Term', palette = 'Blues', linestyles = '', dodge = 0.4, ci='sd')
plt.xlabel('ProsperScore');
plt.ylabel('Investors');
Interestingly, the borrower APR decreases with increasing Prosper rate. But for people with 4.0,7.0 ratings, the APR increase with the decrease of borrow term. Debt to income ration decreases with increasing the prosper score but nearly the sam in 2.0,3.0,4.0,5.0. Investor increases with increasing the prosper score. The loan original amount increases with terms and with prosper score.
plt.figure(figsize=[15, 6]);
plt.suptitle('LoanOriginationQuarter');
plt.subplot(1, 2, 1);
sb.countplot(data = loan, x = 'LoanOriginationQuarter', hue= 'Term', color = default_color)
plt.xticks(rotation=90);
plt.subplot(1, 2, 2);
sb.countplot(data = loan, x = 'LoanOriginationQuarter', hue= 'ProsperScore', color = default_color)
plt.xticks(rotation=90);
plt.suptitle('LoanOriginationQuarter by loan duration for ProsperScore');
g=sb.catplot(x="LoanOriginationQuarter", hue="Term", col="ProsperScore", data=loan, kind="count", height=4, col_wrap=3);
g.set_xticklabels(rotation=90);
<Figure size 432x288 with 0 Axes>
The Loan original quarter increases with moving forward in years and increases it's duration. The higher prosper score is nearly vanished in earlyer years from 2009 to 2011.
plt.figure(figsize=[15, 8]);
plt.suptitle('BorrowerAPR and EmploymentStatus for loan terms');
# BorrowerAPR and EmploymentStatus for loan 12 months duration
plt.subplot(3, 1, 1)
duration = loan.query('Term == 12')
s = duration.groupby(['BorrowerAPR', 'EmploymentStatus']).size()
s = s.reset_index(name='count')
s = s.pivot(index='BorrowerAPR', columns='EmploymentStatus', values='count')
sb.heatmap(s, cmap='rocket_r');
plt.title('12', loc='left');
plt.xlabel('EmploymentStatus');
plt.ylabel('BorrowerAPR');
# BorrowerAPR and EmploymentStatus for loan 36 months duration
plt.subplot(3, 1, 2)
duration = loan.query('Term == 36')
s = duration.groupby(['BorrowerAPR', 'EmploymentStatus']).size()
s = s.reset_index(name='count')
s = s.pivot(index='BorrowerAPR', columns='EmploymentStatus', values='count')
sb.heatmap(s, cmap='rocket_r');
plt.title('36', loc='left');
plt.xlabel('EmploymentStatus');
plt.ylabel('BorrowerAPR');
# BorrowerAPR and EmploymentStatus for loan 60 months duration
plt.subplot(3, 1, 3)
duration = loan.query('Term == 60')
s = duration.groupby(['BorrowerAPR', 'EmploymentStatus']).size()
s = s.reset_index(name='count')
s = s.pivot(index='BorrowerAPR', columns='EmploymentStatus', values='count')
sb.heatmap(s, cmap='rocket_r');
plt.title('60', loc='left');
plt.xlabel('EmploymentStatus');
plt.ylabel('BorrowerAPR');
The most distributed data in whole loan term is employed status followed by full time status.
plt.figure(figsize = [12, 8])
sb.countplot(data = loan, x = 'ProsperScore', hue = 'EmploymentStatus', palette='Blues');
middle ratings seem to have greater proportions of individuals with employment status Not Employed, Self-employed, Retired and Part-Time.
plt.figure(figsize=[15, 8]);
plt.suptitle('BorrowerAPR and ProsperRating for loan status');
# BorrowerAPR and ProsperRating for loan Current
plt.subplot(3, 1, 1)
duration = loan.query('LoanStatus == "Current"')
s = duration.groupby(['BorrowerAPR', 'ProsperScore']).size()
s = s.reset_index(name='count')
s = s.pivot(index='BorrowerAPR', columns='ProsperScore', values='count')
sb.heatmap(s, cmap='rocket_r');
plt.title('Current', loc='left');
plt.xlabel('ProsperScore');
plt.ylabel('BorrowerAPR');
# BorrowerAPR and ProsperRating for loan Completed
plt.subplot(3, 1, 2)
duration = loan.query('LoanStatus == "Completed"')
s = duration.groupby(['BorrowerAPR', 'ProsperScore']).size()
s = s.reset_index(name='count')
s = s.pivot(index='BorrowerAPR', columns='ProsperScore', values='count')
sb.heatmap(s, cmap='rocket_r');
plt.title('Completed', loc='left');
plt.xlabel('ProsperScore');
plt.ylabel('BorrowerAPR');
# BorrowerAPR and ProsperRating for loan Chargedoff
plt.subplot(3, 1, 3)
duration = loan.query('LoanStatus == "Chargedoff"')
s = duration.groupby(['BorrowerAPR', 'ProsperScore']).size()
s = s.reset_index(name='count')
s = s.pivot(index='BorrowerAPR', columns='ProsperScore', values='count')
sb.heatmap(s, cmap='rocket_r');
plt.title('Chargedoff', loc='left');
plt.xlabel('ProsperScore');
plt.ylabel('BorrowerAPR');
Our initial assumptions were strengthened. The outcome of credit depends on Prosper raing, Term, Employment status. Defaulted credits tend to be larger than completed for all ratings except the lowest ones. In terms of purposes of credits more default prone are other and business categories (business category also tend to have larger loans). Long term (60 months) loans are riskier than mid-term and short term.
I extended my investigation of borrower APR against loan amount by looking at the impact of the Prosper rating. The multivariate exploration showed that the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings increased from HR to AA. I then explored the rating and term effects on loan amount, it shows that with better Prosper rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.
A surprising interaction is that the borrower APR and loan amount is negatively correlated when the Prosper ratings are from HR to B, but the correlation is turned to be positive when the ratings are A and AA.
Interesting find was that defaulted credits for individuals with high Prosper ratings tend to be larger than completed credits. Another interesting find that individuals with lowerst rating (HR) have only mid-term (36 months) credits
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTMLmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!